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Sequence similarity search programs are versatile tools for the molecular biologist, 
frequently able to identify possible DNA coding regions and to provide clues to gene and 
protein structure and function. While much attention had been paid to the precise 
algorithms these programs employ and to their relative speeds, there is a constellation 
of associated issues that are equally important to realize the full potential of these 
methods. Here, we consider a number of these issues, including the choice of scoring 
systems, the statistical significance of alignments, the masking of uninformative or 
potentially confounding sequence regions, the nature and extent of sequence 
redundancy in the databases and network access to similarity search services. 
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The advent of rapid DNA sequencing technology in the 
mid- 1970s led to an information explosion that continues 
unabated today. Molecular sequence data have become 
the common currency of biomedical research and often 
provide unexpected links among diverse biological 
systems. These connections accelerate research progress 
and may even open up entirely new fields of inquiry. One 
approach to discovering such connections, database 
"homology" searching, has been executed countless times, 
often with surprising results and has become an essential 
method for the molecular biologist. While the particular 
algorithm used is of course important, the effectiveness of 
database searches is dependent as well on a large number 
of correlative factors, many of which tend to be overlooked 
or dealt with an an inefficient or ad hoc manner. These 
include the following: 

Scoring systems. Most database search algorithms rank 
alignments by a score, whose calculation is dependent 
upon a particular scoring system. Usually there is a default 
system, but it may not be ideal for a user's particular 
problem. For example, haemoglobin subunits used to be 
regarded as "typical" proteins and are often still used as 
benchmark query sequences for evaluating new database 
search techniques and scoring systems. However today it 
is more common to encounter much larger and more 
complex sequences (see below) and methods developed 
and optimized for small, uniformly-conserved, single- 
domain proteins are inadequate. Scores that are best for 
detecting similarities between greatly diverged sequences 
differ from those best for detecting short but nearly 
identical segments^*^. Optimal strategies for detecting 
similarities between DNA protein-coding regions differ 
from those for non-coding regions^-*. Special scoring 
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systems for detecting frame-shift errors in the databases 
have recently been described^. A database search program 
should therefore make a variety of scoring systems available 
and users should be aware of which ones are best suited to 
their problems. 

Alignment statistics. Given a query sequence, most 
database search programs will produce an ordered list of 
imperfectly matching database similarities, but none of 
them need have any biological significance. An important 
question is how strong a similarity is necessary to be 
considered surprising. United by a common theory, a 
number of analytic*"^ and empirical results^'^*'"^' are now 
available for assessing database search results. However, 
one still sees occasional extravagant claims in the literature, 
usually springing either from misapplication of the normal 
distribution or from an absence of critical statistical 
analysis. 

Databases. The use of an up-to-date sequence database is 
clearly a vital element of any similarity search. Sequence 
relationships critical to important discoveries have on 
occasion been missed because old or incomplete databases 
were employed. However, the variety of databases available, 
and their overlapping coverage, has the potential to render 
similarity searching cumbersome and inefficient. This no 
longer need be the case. Timely access to complete and 
"nonredundant" sequence databases has become relatively 
simple and inexpensive. 

Database redundancy and sequence repetitiveness. 
Surprisingly strong biases exist in protein and nucleic acid 
sequences and sequence databases. Many of these reflect 
fundamental mosaic sequence properties that are of 
considerable biological interest in themselves, such as 
segments oflow compositional complexity or short-period 
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Table 1 The BLAST family of programs 



Program^ Query Database 

sequence sequences 



Connments 



BL^STP protein 



BLASTN nucleotide nucleotide 
(both strands) 



BLASTX 



nucleotide 
(six-frame 
translation) 



TBLASTN protein 



protein • Default scoring matrix^ is BLOSUM62; 

change with command line option 
"M=PAM250", for example 

• Low-complexity masking with "-filter" 
option; choice of either the SEG^^ and XNU^"* 
algorithms 

• Parameters optimized for speed, not 
sensitivity; not intended for finding distantly- 
related, coding sequences 

• Automatically checks complementary strand 
of query 

protein • Very useful for preliminary data containing 
potential frameshift errors* 

• Nine different genetic codes available''; 
change with command line "C=1" (vertebrate 
mitochondrial) for example 

• Low-complexity filter option as for BLASTP 

nucleotide • Essential for searching protein queries 

(six-frame against dbEST^ 

translations) • Often useful for finding undocumented open 

reading frames or frameshift errors in 

database sequences 

• Same genetic code options as for BLASTX 



^These programs are available through the BLAST Network and e-mail servers (see 
text) and the source codes are available by anonymous ftp on ncbi.nlm.nih.gov. 
''More than 65 different PfiM^^^-^-^, BLOSUM*^'*^ and other scoring matrices are 
available. PAM120 or BLOSUM62 are best for general purposes but a useful 
combination for detecting strong and short to long and weak similarities consists of 
PAM30, PAM120 and PAM250 (ref. 2). 

<=DefauIt genetic code (C=0) is "standard" or "universal" code. Other codes available 
include: 1 , Vertebrate mitochondrial; 2, Yeast mitochondrial; 3, Mold mitochondrial 
and mycoplasma; 4, Invertebrate mitochondrial; 5, Ciliate macronuclear; 6, Protozoan 
mitochondrial; 7, Plant mitochondrial; and 8, Echlnodermate mitochondrial. 



repeats. Databases also contain some very large families of 
related domains, motifs or repeated sequences, in some 
cases with hundreds of members. In other cases there has 
been a historical bias in the molecules that have been 
chosen for sequencing. In practice, unless special measures 
are taken, these biases very commonly confound database 
search methods and interfere with the discovery of 
interesting new sequence similarities. Problems include 
the occurrence of misleading, spuriously-high scores, 
ambiguities in the phase of sequence alignments and 
overwhelmingly large output lists in which interesting 
results maybe inconspicuously buried. We shall describe 
some recently developed methods that largely solve these 
problems by automatically detecting and masking 
potentially confounding subsequences. 

Failure to deal properly with the factors described above 
can result in chance similarities being claimed significant, 
or biologically important relationships being overlooked. 
Here> we shall discuss these and several other issues in 
database searching. While we will frequendy use the BLAST 
programs^" (Table 1) as examples, most of the questions 
considered have quite general relevance. 

Algorithms and programs 

The earliest sequence comparison studies focussed on the 
alignment of complete sequences^^"^^ However, with the 
recognition that proteins frequently share only isolated 



regions of similarity, corresponding for instance to 
structural motifs or active sites, attention shifted to 
algorithms for local alignment^®"^^ Essentially all database 
search methods have been based upon measures of local 
sequence similarity. 

In general, local alignments are assessed by means of a 
score, which is computed as the sum of scores for aligned 
pairs of residues and scores for gaps^^. How these scores 
are chosen, and what they signify, is discussed below. The 
time necessary to find alignments that optimize such 
scores is sufficiently great that, for most practical purposes, 
either parallel architecture machines^^"^^ or heuristic 
methods such as Fasta^^-^* are required. The problem may 
be simplified by forbidding gaps. This leads to faster 
heuristic methods such as the BLAST algorithms''^"* (Table 
1), as well as to efficient hardware implementations^^. 
While some sensitivity to weak similarities maybe lost by 
eschewing gaps^°, easier generalization^^ and rigorous 
statistical results^ become available. Alternatively, local 
alignments maybe assessed in a more sophisticated manner 
than by the simple sum of substitution and gap scores^^. 
This may lead to more sensitive detection of weak 
similarities, but at the price of greatly increased 
computation time^^. 

In general, the relevant considerations in choosing a 
particular algorithm are hardware requirements, speed 
and sensitivity to biological relationships. The tensions 
between these competing claims are resolved variously by 
programs such as Fasta^*, BLAST'"^ and Blaze^^ The relative 
merits of these and the other programs have been discussed 
at length elsewhere^'*'^\ The idea of optimizing a measure 
of local similarity is common to virtually all popular 
programs, and the results they produce therefore do not 
differ in any truly essential way. 

Local alignment statistics 

Not all biologically important sequence relationships will 
be detected by sequence similarity search programs and, 
even when foimd, they may be lost among irrelevant or 
chance similarities. While experiment is the ultimate 
arbiter of biological significance, mathematical analysis 
can indicate which similarities are unlikely to have arisen 
by chance and therefore merit special attention. Thus an 
important question concerning alignments produced by 
any database search is whether they can be considered 
statistically significant. 



0.4 




Fig. 1 The probability density function of the extreme value 
distribution with characteristic value u=0 and decay 
constant %~^ . 
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One approach sometimes taken is to record an optimal 
local alignment score for each database sequence and 
then to report these scores as standard deviations from 
the mean. There are several serious and frequently 
unrecognized pitfalls to this procedure. First, the optimal 
scores for the comparison of a query sequence to different 



database sequences can not be assumed to be drawn from 
the same distribution. The longer a given database 
sequence, the greater the score expected by chance. Also, 
variation in residue composition among sequences can 
yield different score distributions. Second, unless a 
rigorous optimization algorithm is employed, the true 



Box 1 The extreme value distribution and local sequence similarities 

Just as the sum of many independent random variables results naturally in a normal distribution, the maximum of 
many independent random variables yields an extreme value distribution^*. (For rigour, this statement must be 
qualified in many ways, but we will omit the technicalities here.) Because the score of an optimal local alignment is, 
for practical purposes, the maximum of many essentially independent alignment scores, the extreme value 
distribution plays a central role in the statistics of local sequence alignments. This distribution may be described by 
two parameters, the characteristic value, u, and the decay constant, X; the probability of observing a score greater 
than or equal to x purely by chance is given by the formula 

1-exp(-e'^-^') 

The probability density of the standard extreme value distribution, with u=0 and X=\, is shown in Fig. 1 , For random 
sequences, the maximal segment pair scores used by the BLAST algorithms'*"*'^^ can be shown to obey an extreme 
value distribution^'^. While analysis is not available for the scores of alignments with gaps, experiment^^^^ and 
analogy^'"^'^®^^ strongly suggest that they too should obey this type of distribution. 

In order to use the formula above, one needs to estimate the relevant parameters u and X for a given sequence 
comparison. These will, in general, depend upon the composition and length of the sequences being compared, and 
upon the particular scoring system used. For alignments with gaps, the parameters may be estimated by random 
simulat^o^^^ or by examining optima! local alignment scores from unrelated sequences^^-^^. For ungapped 
alignments, the parameters may be calculated directly^. In this case, the parameter u may be written as 

\nKmn 



where m and n are the sequences' lengths and K and X may be calculated from the substitution scores and 
sequence compositions^"*. 

We have described how to calculate the probability, p, that a given local-alignment score would arise from the 
comparison of two random sequences. This probability must be adjusted for the multiple comparisons performed in 
a database search (see text). The applicable Poisson distribution implies that the probability of observing at least one 
alignment with pairwise p-valuep from a search of a database containing D sequences may be estimated as 

When P<0.1 , it may be well approximated as simply Dp. This approach makes the implicit assumption that all 
sequences in the database are a priori equally likely to share some relationship with the query. An alternative view, 
based on the idea that many proteins possess multiple domains, is that all equal-length protein segments in the 
database are a priori equally likely to be related to the query. This approach implies a different normalization. Assume 
that the alignment of interest involves a database sequence of length n residues, and that the complete database has 
N residues. Then, in the equation above, D should be replaced by N/n. This is the default normalization currently 
employed by the BLAST programs, (In the context of DNA as opposed to protein database searches, it is the only 
normalization that really makes sense.) Reasons for calculating significance in the context of pairwise protein 
comparisons in the first place, rather than sequence-database comparisons, are to allow for multiple high-scoring 
alignments and for protein compositional heterogeneity. 

The BLAST programs^'*^ (Table 1) may generate several high-scoring alignments for a given pair of sequences. 
While the significance of these alignments may be assessed individually, it is frequently of value to construct a 
combined assessment. One method uses the fact that the number of segment pairs expected by chance to have 
score at least x is approximately Poisson distributed, with parameter e"'-'^""' (refs 6-8). Thus, if three distinct segment 
pairs with scores 50, 45 and 40 are found in a given pairwise comparison, one may calculate the probability p that at 
least three pairs, all with score at least 40, would appear by chance. This approach has the weakness of depending 
upon only the lowest among the r greatest scores. Alternatively, one may calculate the sum S^ of the r highest scores. 
The random distribution of such sums has been derived and the appropriate tail probability is available numerically 
as a double integral. 

The BLAST programs currently use the former, Poisson method, of assessing multiple high-scoring segment pairs. 
Not all sets of segment pairs, however, warrant a joint assessment. Only when such a set may be combined into a 
consistent, gapped alignment is it really appropriate to consider the separate segment pairs as parts of a greater 
whole. Accordingly, as a default, the BLAST programs require such consistency before calculating a joint statistical 
assessment. The imposition of such consistency has the further advantage of sharpening the joint statistics^. 

The problem of multiple tests arises again in using either the Poisson or sum p-values described above. For 
example, while the probability for finding at least three segment pairs with score at least 40 may be valid, in practice 
one has considered as well the single best segment pair in isolation, the two best segment pairs, etc. These multiple 
tests can result in too optimistic a significance claim for the best overall result. P. Green (personal communication) 
has suggested a simple solution to this difficulty: dividing the p-value for a result involving r segment pairs by the 
factor {^-a)a'-\ where a is a constant between and 1 , yields a conservative p-vaiue for the multiple tests. The 
parameter a can be viewed as a "gap penalty." Choosing a near greatly favours results involving a single segment 
pair. Choosing a near 1 favours results with fewer segment pairs only slightly, but may underestimate significance 
because of the actual non-independence of the multiple tests. The p-vaJues reported by the BLAST programs 
implement this multiple test discount procedure, with a default of a=0.5. □ 
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optimal pairwise scores will be systematically 
underestimated and the shape of the true distribution 
will be ill-determined. Third, comparing a query sequence 
to a set of uniform length random sequences yields scores 
that obey not a normd but an extreme va/ue distribution 
(Box 1 and Fig, 1). The tail of this distribution decays 
exponentially in x rather than x^, so assuming normality 



tends grossly to exaggerate an alignment's significance. 
Finally, a database search involves many essentially 
independent trials. If the database contains 50,000 
sequences, a score with probability 10^ of arising from a 
single comparison is only marginally significant in the 
context of the complete search. The last two points alone 
imply that an alignment may easily achieve a score over 
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ten standard deviations from the mean yet fail to be 
statistically significant. 

Box 1 discusses the extreme value distribution and how 
it may be used to calculate the probability that a gap-free 
local alignment with a given score would arise from the 
comparison of two random sequences. It also describes 
how to modify this probability to account for the "multiple 
tests" of a database search. Such a search can itself generate 
data which provide an alternative to the analytic method 
(Box 1 ) for estimating alignment statistical significance^^ 
For a given query, one records the best alignment score to 
each database sequence. If score S is observed /fSj times, 
then plotting log f(S) versus S tends to produce a straight 
hne; extrapolation of this line can yield estimates of 
statistical significance^^ 

One advantage of this approach is that it is applicable 
to cases for which no rigorous theory is available, such as 
scores from gapped alignments. Thus heuristic programs 
such as Fasta^*, or parallel implementations of the Smith- 
Waterman algorithm^^ such as Blaze^^ or Blitz^^ can 
estimate statistical significance using this method. 
Furthermore, because the scores generated derive from 
comparisons of real sequences, no "random protein" 
model is needed. A disadvantage of the method is the 
need to generate optimal alignment scores for a substantial 
fraction of database sequences in order to calculate 
statistical significance. Potential inaccuracy arises from 
variation in database sequence size and composition, 
which implies that each data point is really drawn from 
a separate distribution^■^°■^^ Also, if many sequences 
related to the query are present ( see discussion on database 
redundancy below), it maybe difficult to base the plotted 
line upon only unrelated sequences. An alternative "curve 
fitting" approach is to estimate the parameters of the 
implicit extreme value distribution for the scoring system 
at hand^'^°'^^'^^ In one form or another, curve fitting will 
generally be necessary to calculate the statistical 
significance of scores derived from gapped alignments or 
o3ier complex scoring systems^'^""^^. 

The most important "failure" of the local alignment 
statistics discussed here is on comparisons of regions with 
restricted or unusual amino acid or nucleotide 
composition. Such regions are quite common in proteins, 
but are clearly not well described by the same random 



modelusedfor other sequence regions (see below). Because 
an alignment of such "low complexity" regions has little 
real meaning, it is best simply to note their existence, but 
exclude them from alignments produced in database 
searches (see Figs 2 and 3 for examples). 

Scoring matrices and gap costs 

Many different amino acid substitution score matrices 
have been proposed over the years for use with sequence 
comparison and database search programs^'^'^^^, and a 
variety of rationales have been used for their construction. 
However, it is possible to show that in the context of 
seeking high- scoring segment pairs without gaps, any 
such matrix has an implicit amino acid pair frequency 
distribution that characterizes the alignments it is 
optimized for finding. More precisely, let p. be the 
frequency with which amino acid i occurs in proteins 
sequences and, within the class of alignments sought, let 
q.. be the frequency with which amino acids i and j are 
afigned. Then the scores that best distinguish these 
alignments from chance are given by the formula 

S.. = logil 

The base of the logarithm is arbitrary, affecting only the 
scale of the scores. Any set of scores useful for local 
alignment can be written in this form, so a choice of 
substitution matrix can be viewed as an implicit choice of 
"target frequencies" g,^(refs 1,6). 

The target frequencies characterizing alignments of 
closely related sequences clearly differ from those for 
alignments of sequences that are greatly diverged. 
Therefore a single matrix can not be optimal for 
recognizing relationships at all evolutionary distances^*^'^^. 
It has been argued that for most practical purposes, three 
separate matrices should be adequate for locating all 
alignments containing sufficient information to rise above 
background noise^*^. The question remains how best to 
estimate the appropriate correspondingtarget frequencies. 

Estimating the frequencies with which the various 
amino acids tend to mutate into one another is a 
necessarily empirical problem. The first approach to the 
question was taken by Dayhoff and coworkers^^-^*. Their 
"PAM" model of molecular evolution allowed target 
frequencies and the corresponding score matrices to be 



'^Fig. 2 Significant sequence matches of the human MTG8 product: the effect of low-complexity masking. MTG8 (ref . 84) is the translated 
product of a chromosome 8 gene involved in a t(8:21) translocation that results in an AML1-MTG8 fusion transcript in a case of acute 
myeloid leukaemia (GenBank accession number D14820). a, Automated segmentation of low-complexity sequences in MTG8 at relatively 
high stringency. To be defined as low-complexity in this run of the SEG algorithm (Box 2), a sequence region must contain at least one 1 2- 
residue window with complexity {K, Box 2) less than 0.31 5. SEG then finds the minimally probable (lowest P^, Box 2) low-complexity 
subsequence, of any length, within the overlapping windows of this region. The sequence segments read from left to right and their order in 
the polypeptide runs from top to bottom, as shown by the central column of residue numbers, b, The strong match, which emerges clearly 
without masking (Poisson p-value 2.5 x 10^, between sections of MTG8 and Drosophila me/anogasfer transcription factor TFIID 11 0-kDa 
subunit85-»«. c, MTG8 filtered as in (a) but with the low-complexity segments masked by "x" characters, for use as a query sequence in 
database searches, d, The significant match between a region of MTG8 containing a cysteine cluster and rat apoptosis protein RP-8. RP-8 
(ref. 87) is a gene expressed eariy in the process of programmed cell death (apoptosis) following glucocorticoid induction in rat thymocytes 
(GenBank accession number M80601). This match**, had a Poisson p-vaiue of 0.0036 for a BLASTP search of the NCBI non-redundant 
database of 13th September 1993. *, Identical amino acids; I, Conserved Cys or His residues. Also shown is a sample of the class of zinc- 
fingers that occur in the DNA binding domain of the steroid receptor family*, indicating a suggestive similarity (which is not statistically 
significant by pairwise alignment statistics and would require experimental confinmation) in the positions of most of the Cys or His residues. 

Before low-complexity filtering, MTG8 generated an output list from the NCBI non-redundant database of greater than 400 Kbytes 
containing 699 database sequences scoring above the BLASTP default threshold. The significant match to apoptosis protein was an 
inconspicuous 62nd in this list and scored much lower than many spurious low-complexity matches. After masking of MTG8 as in (b), this 
match was 6th in a list of 83 sequences. The latter list contained many matches to a "medium complexity" region of MTG8 which is 
tentatively predicted to be alpha helical coiled coil (residues 416-476). Further filtering with SEG at lower stringency (K < 0.365 for a 14- 
residue window) effectively masked this region, and resulted in a BLASTP output list of only 9 sequences, in which the apoptosis protein 
was ranked in score only below the MTG8 self-matches and the match to TFIID 110-kDa subunit. 
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Box 2 Low-compiexity sequences and short-period tandem repeats 

To study low-complexity sequences and short-period tandem repeats, we first consider 
sequences as mixtures of regions with unknown statistical properties and then attempt to infer 
these properties. In order to put all possible low-complexity segments on an equal footing, we 
define local compositional complexity ignoring prior probabilities for the 20 amino acids or 4 
nucleotides^^^^'^^ Complexity is a function of the compositional state of a sequence segment or 
window. For example, the numbers (3,2,2,1 ,1 ,1 ,1 ,1 ,0,0,0,0,0,0,0,0,0,0,0,0), representing, in 
decreasing order of abundance, counts for the various amino acids, describe one of the 77 
possible complexity states of a 12-residue peptide window. Many possible sequences and 
amino acid compositions, with different residue types corresponding to the 20 numbers, share 
this complexity state, Formally, we define the local compositional complexity K of a sequence 
window of length /., as 



logao 



n> 



where the n, are the 20 numbers in the compositional state vector described above. Analogous 
to the enumeration of microstates in statistical mechanics, K measures the information per 
position needed, given the window's composition, to specify a particular residue order. 
Assuming uniform prior probabilities for the appearance of the various residues, the probability 
P for the occurrence of a given compositional state is 



1 
20'^ 



LI 



n 



20! 



n 



where r^ is the count of the number of times the number k occurs among the n^. K and P^ are 
functions of only the complexity state vector; they do not depend on which amino acids 
correspond to the 20 numbers in the vector or on the actual probabilities of the various amino 
acids. For the DNA alphabet, 4 replaces 20 in the above equatJons^^^^. 

SEG^^ Is an optimal segmentation algorithm based on the theory described above. It 
identifies, at a defined level of stringency, al! the low-complexity segments in a sequence that 
minimize P^ within a local region of low K. A similar approach may be used to identify tandemly 
repeated segments of any defined period; methods for the purpose are under development. A 
heuristic algorithm, XNU", for identifying and masking short-period repeats finds self-matching 
segments that yield high PAM or BLOSUM scores when offset by a small number of residues, 
regardless of local compositiona! complexity. With appropriate parameterization, XNU and SEG 
are complementary. 

Programs such as SEG and XNU may be used to mask appropriate query sequence segments 
prior to database searching, replacing the residues in these segments by "x" characters (see Fig. 
1c). The score for "x" in each row or column of a PAM or BLOSUM amino acid matrix may be 
calculated as the mean of the 20 residue-pair scores in that row or column, insuring that the 
impact of the masking character on the distribution of matching segment scores is minimal. n 



matrices are perhaps nearly optimal 
for this more general case. Gapped 
alignments present the additional 
problem of choosing appropriate gap 
costs"^^. The simplest algorithms 
require these costs to be a linear 
function of gap length''^"^^, but 
efficient algorithms for more general 
gap costs are also available^'. Because 
no theory exists, appropriate gap costs 
have generally been chosen by trial 
and error, although there have been 
some recent efforts to give this 
problem a sounder empirical 
footing^^'^^. 

The user of database search 
programs should recognize that the 
default substitution scores and, where 
applicable, gap costs, have generally 
been chosen to be appropriate for the 
most frequent sort of query. These 
scores may not, however, be optimal 
for a specific problem. In particular, 
matrices such as PAM- 120 or 
BLOSUM-62 (the current BLAST? 
default)^' are tailored for alignments 
of moderately diverged sequences. 
Very strong but short similarities, or 
very long but weak ones may easily be 
missed by these matrices^~\ A fully 
functional database search system 
should therefore provide a range of 
scoring systems to its users, so that 
the algorithm can be adapted to the 
problem at hand. 



calculated for any desired amount of evolutionary change. 
The details of the PAM model have been criticized*'^, and 
the vast increase in available sequence data has prompted 
recalculation of the model's parameters^'^'^l Scores for 
DNA sequence comparison based on a PAM-like 
mutational model have also been described^ A different 
approach to estimating appropriate target frequencies 
relies not on fitting an evolutionary model, but rather on 
the direct observation of relatively distant, but 
nevertheless presumed largely correct, sequence 
alignments'^ A variety of empirical tests have been 
claimed to support the superiority of the resulting 
"BLOSUM" matrices for detecting sequence 
homology*^'^^ Lacking an evolutionary model, however, 
this approach is less adaptable to generating matrices 
tailored to specific applications^ ^ 

The theory Unking substitution matrices with target 
frequencies is rigorously established only for local 
alignments lacking gaps. Therefore the development above 
is generally valid only for the BLAST and related 
algorithms^*^'*^^. A more general theory for alignments 
with gaps should, however, have the same broad 
outlines^**'^^, and target frequency based substitution 



Databases and access 

The most important requirement for 
database searching is a 
comprehensive, up-to-date database. 
Full releases of GenBank® now occur 
every two months, and daily updates 
are available for downloading or direct searching by e- 
mail and network services^"^. GenBank has undergone a 
major expansion in data coverage and now includes, in 
addition to nucleotide sequences, data from the major 
protein sequence and protein structure databases, as well 
as data from U.S. and European patents^*. Approximately 
36% of the records in GenBank are produced by the 
international collaborators, EMBL Data Library^^ and the 
DNA Database of Japan (DDBJ), with whom database 
updates are exchanged daily. Copies of the databases are 
available at many sites worldwide^^'^^ 

GenBank (release 80.0) contains 164 megabases of 
sequence and is doubling in size every 21 months (D. 
Benson, personal communication). This rate can only 
increase as a result of genome projects and automated 
sequencing technology. As mentioned above, special 
purpose computers have a role in maintaining reasonable 
search performance in the wake of this data deluge, but 
considerable improvements in search efficiency can be 
obtained by considering the nature of the data itself. 

Many sequence databases have a large degree of internal 
"redundancy" for historical reasons related to available 
technology and research trends, and also due to the 
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existence of clusters of closely-related sequences from 
multigene families. Also, equivalent gene products have 
frequently been sequenced in a number of different species 
or organisms. In release 36.0 of PIR International^*, for 
example, there were 653 members of the globin 
superfamily, 349 cytochromes c, 583 sequences with 
immunoglobulin domains and 274 protein kinases. 
Considering only perfectly matching sequences, among 
the 52,257 protein sequences in this database, there are 
over 3,900 duplicate entries and over 3,800 perfect 
substrings of longer entries that together comprise about 
10% of the total amino acid residues. Among nucleic acid 
sequences there are thousands of Alu variants in GenBank. 
And the problem of redundancy is only getting worse: as 
a result of projects designed to sample expressed genes 
rapidly""^^, tens of thousands of sequence fragments are 
being added to the databases^*^; many of these sequences 
represent small pieces of known genes. Due to the error- 
prone nature of these sequence fragments^'', identifying 
redundancy in these collections is a more difficult task. 

As well as decreasing the speed of database searches, 
redundancy can obscure novel matches in the output, by 
yielding slews of similar or identical alignments. Practically, 
there are two simple ways to avoid this problem: i) 
construct a smaller "nonredundant" database^^; ii) 
preprocess the query sequence for the presence of known 
domains and mask these prior to searching. (The concept 
of query masking is discussed in the next section.) 

NCBI" maintains two quasi-nonredundant sequence 
collections (NRDB), one for proteins and one for nucleic 
acids. For example, the protein NRDB is constructed 
iteratively starting with SWISS-PROT^^ which is the 
smallest and least redundant of the major protein 
databases. All of the proteins in PIR International^^ are 
compared to those in SWISS-PROT, and identical 
sequences are excluded from the former while maintaining 
pointers to relevant annotation. Next, all of the protein 
translations from GenBank coding sequences ("GenPept") 
are compared to the merged SWISS-PROT plus PIR, 
Likewise, protein sequences from the Brookhaven 
structure database (PDB) and other sources are 
incorporated into NRDB. (The OWL nonredundant 
sequence database^' is constructed from the same sources.) 
This simple procedure reduces the size of the combined 
databases by 50%, yet ensures that all sequences are 
represented. More sophisticated methods for creating 



derived, composite views of protein and DNA sequence 
data promise even further reductions^*. 

Another key issue is access to the databases. Researchers 
may perform database similarity searches remotely by 
sending their queries, via electronic mail, to centralized 
"server" computers, where large and frequently updated 
databases are maintained, and where fast processors and 
sophisticated software are available. E-mail services of 
this sort have been available from various sources for 
several years. For example, NCBI provides the BLAST e- 
mail server (for more information, send a "help" message 
to the Internet address blast@ncbi.nlm.nih.gov), and 
EMBL provides Blitz (nethelp@embl-heidelberg.de). 
Additional sites and services are given in ref. 64. In addition 
to database search and retrieval services, such sites maintain 
repositories of public domain software and specialized 
datasets that may be accessed via **anonymous ftp" over 
the Internet^^ The existence ofhigh-performance networks 
is also giving rise to a new generation of "client- server 
applications" that make possible direct, real-time user 
interactions with remote servers. NCBI's BLAST network 
service and Ewtrez retrieval system are two examples. For 
users of the many excellent commercial software packages 
for sequence analysis, we would anticipate the development 
of network client-server capabilities in the near future. 

Masking of tow-complexity sequences 

Interspersed local regions of very simple amino acid 
composition are surprisingly abundant in protein 
sequences^^. Some of these regions are homopolymers or 
short-period repeats, but most are not periodic and appear 
as mosaics of predominantly one or a few types of residue. 
Their compositional bias is in marked contrast to the 
structural domains and motifs of globular proteins familiar 
from crystal and NMR structures. Based on a relatively 
stringent definition of low-complexity^^ more than half 
of the sequences in the database contain at least one such 
region, and 14% of the amino acids occur in clusters of 
highly biased local composition. Moreover, a large excess 
of "medium-complexity" regions may be defined using a 
less stringent definition of complexity: these are found in 
many recently-deduced protein sequences that lack true 
homologues and do not belong to the class of "ancient 
conserved sequences"^^. Very little is known about the 
molecular structures, dynamics, interactions and evolution 
of most low- and medium- complexity protein segments. 



^Fig. 3 The mouse protein Sos1 functions as a key intermediate in transmitting signals from receptor tyrosine l<inases to ras via protein-protein interactions^^-*'. 
Sos1 (PtR accession S21391) is a member of a family of ras guanine nucleotide-releasing proteins (GNRP) that also includes S. cerevisiae CDC25 and SDC25, S. 
pombe St©6, and the Drosophila gene, Son of s©venless^^ Mouse Sos1 is a large, mosaic protein with several different domains, including a rasGNRP domain and 
a low complexity region that binds to an "adapter" protein called Grb2«2 a, Results of a BLASTP search using an Sos1 query sequence without any masking 
applied. In addition to several "self hits" in the output, we see significant matches to some S. cerevisiae proteins, but Ste6 does not appear in the top 25 matches 
despite its presence in the database (PIR International, release 37). Moreover, the tnje positive matches are interspersed with many false positives, consisting of a 
number of functionally unrelated proline-rich proteins. These artifactual matches are highly significant in the statistical sense, but a glance at some of the local 
alignments shows that one is not justified in inferring similar function despite the high scores and low p-values. b, An identical search, except that in this case the 
Sos1 query has been pre-processed using SEG masking with default parameters. Note that the top of the "hit list" is now populated only by bona fide members of 
the rasGNRP family and that all artifactual matches against proline-rich proteins have disappeared. Furthermore, a match to S. pombe Ste6 is now obvious; a local 
alignment between this protein and Sos1 is shown. Interestingly, Sos1 shows significant local similarities to histone H2A and p-spectrin (see below), c. Results of 
another search with masking of both low complexity regions {b) and the rasGNRP domain. The top four matches now consist only of those proteins that share 
more extensive, or global, similarity with the query beyond the rasGNRP domain. In this example, the additional information gained by this extra masking step is 
not striking But one can imagine the dramatic effect this would have In shrinking the "hit list" if the query possessed a kinase domain, of which there are hundreds 
of examples in the database. (See ref. 74 for an example involving immunoglobulin domains), d, The query sequence, mouse Sos1 , annotated with the various 
domains indentifiable by BLASTP searching. The rasGNRP domain is according to Boguski & McCormick^^. The proline-rich carboxy tenninal region is known to 
interact with Src homology (SH3) domains in Grb2«^ With regard to the local similarities between Sos1 and histone H2A and p-spectrin, it has recently been shown 
that Sos1 (3-spectrin and a number of other proteins possess "pleckstrin homology" or PH domains^^. The local alignment produced by 6U\STP (c) corresponds 
to these PH domains. The similarity between Sosi and histone H2A has not previously been reported and is difficult to interpret biologically. Nonetheless, the 
similarity is as significant as that of the PH domain and may have structural, as opposed to functional, implications^. 
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Low- complexity segments confound database search 
algorithms in two ways. First, most of these segments do 
not generally give meaningful alignments position by 
position in ways that reflect actual structure and mutational 
history: they evidently evolve relatively rapidly by processes 
such as replication slippage and repeat expansion^^. (At 
the DNA sequence level, trinucleotide and dinucleotide 
repeat polymorphisms provide a familiar example^^'^^) 
Permutations, shufQes or reversals of low- complexity 
amino acid sequences generally give alignment scores 
similar to the original sequence. Second, the residue 
compositions of low-complexity segments are very 
different from that of the database as a whole. This is 
evident if all low-complexity segments in the database are 
grouped into a single class: a strong excess of alanine, 
glycine, proline, serine, glutamate and glutamine results. 
However, this lumped class is itself heterogeneous, 
containing for example glutamine-rich and proline-rich 
subclasses. These statistical biases contrast with those that 
characterize the bulk of most query and database 
sequences, and on which score-based alignment statistics 
are founded. Thus the high scores of alignments of low- 
complexity segments are due primarily to their 
compositional biases and do not necessarily reflect 
significant positional similarity. 

Several classes of low-complexity residue clusters have 
been analysed for statistical significance by Karlin and 
coworkers^^"". Their methods, which use the contrasting 
residue frequencies of specific clusters and those of 
complete proteins or databases, are embodied in the SAPS 
software". SEG^^, the algorithm employed by the BLAST 
programs for filtering low-complexity segments from 
query sequences prior to database searching (Figs 2 and 
3 ) , employs instead optimal segmentation methods applied 
to a more general definition of compositional complexity 
(see Box 2). 

Masking of highly abundant sequences 

Database searching can be performed efficiently in phases, 
with a query first compared to a small database containing 
domains representative of large sequence families. 
Subsequences of a query that match one or more of these 
domains can then be masked prior to full-scale searching, 
thereby eliminating most of the redundant output^"^. 
Annotated collections of prototypic human repetitive 
sequences^^ such as Alu and protein kinase catalytic 
domains^*, exist and can be used to pre-filter a query (Fig 
3c). (Both of these data sets are available from the NCBI 
Data Repository on CD-ROM and by anonymous ftp. See 
/repbase/alu, /repbase/humrep and /pkinases/pkcdd.fa at 
ncbi.nlm.nih.gov.) For proteins, a more comprehensive 
solution to the problem is approached by building a small, 
representative set of protein superfamiUes or motifs and 
using this as a screening database with automatic masking 



of matching query subsequences (unpubHshed results). 
This technology is still under development but recent 
studies indicate that a representative set of only 1,000- 
3,000 sequences may suffice^^; such a database can be 
searched in seconds. The first large-scale implementation 
of this strategy has been performed for a specialized 
database of "expressed sequence tags" or ESTs^ where 
such pre-filtering is also employed to detect contamination 
by vector sequences. 

Conclusions 

The stated goals of the U.S. Genome Project include the 
production of 50 megabases of DNA sequence data per 
year by 1998 and the identification and correlation of 
genes in humans and model organisms^^ Database 
similarity searching will be one of the major informatics 
tools used in this endeavor. Not only efficient algorithms, 
but also a choice of appropriate scoring systems, well- 
defined measures of statistical significance and a better 
understanding of the sequences themselves, are critical 
for the automated analysis schemes that this amount of 
data will inevitably require. 

Special purpose and faster general purpose computers 
will have roles in sifting through this increasing volume of 
sequence data. But large improvements in the efficiency 
of searching can be obtained by considering the nature of 
the data and implementing new strategies that capitalize 
on this knowledge. One of these strategies is to preprocess 
a query sequence to identify known domains and motifs, 
dispersed repeats, low complexity segments and other 
regions of compositional bias such as potential membrane- 
spanning and a-helical coiled-coil regions. We have 
described several preprocessing techniques that are suitable 
for automation and have demonstrated their practical 
utility with examples. Foreknowledge of query features 
enables one to perform faster and more effective searches 
better and to evaluate search results. 

Another, complementary strategy is to reduce the 
redundancy in the target database(s) to be searched. 
We have outlined one simple but useful approach to 
the reductive merging of diverse, but overlapping, 
source databases. But newer, cleaner and richer views 
of the sequence data, optimized for gene discovery, are 
on the horizon. 

Note added in proof: NCBI has recently established a 
GenBank® World Wide Web server (the URL is http:/ 
/www.ncbi.nlm. nih.gov) that provides network access 
to many of the software tools and data sources described 
in this review. 
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